We are doing this because we want extract insight out of data. We want to be able to learn something from the information we have.
This workshop—and my own learning in R—uses a number of key ideas from Hadley Wickham’s R for Data Science.1 It is a wonderfully written book with a beginner audience in mind, and goes into far greater depth than this workshop can. However, with this workshop behind you I hope you’ll be able to jump into R for Data Science with confidence. From its introduction is this (adapted) graphic:
It lays out the project of any data science work:
And, importantly, we’re here to have fun. To make fun and exciting visuals like
[EXAMPLE FROM BBC? or elsewhere that have used R]
R is a script-based language. You write down a list of instructions and it will follow, performing one action after another. This is different to ‘point and click’ software like Microsoft Excel, and it can feel a bit cumbersome.
In Excel, you can perform a series of steps:
But, there is no record of any of this. You might look back at the data in a week’s time and not know what rows you have deleted. This is particularly important when you go to write up your metholodolgy for an assignment or thesis. You might not be able to recreate your own work.
R is a script-based program. You write a list of instructions and R will follow it. This is wonderfully handy:
There are many proprietary script-based programs for analysing data: Stata, SAS, Eviews, SPSS, Matlab. I imagine you will encouter a few of these during your undergrad.
They cost $$$. If you are not at a university/workplace that has the program, you can’t use it. Even if you are at a place that has a license, you might have to use a special computer lab to use the program. Or you might have to pay for a license yourself.
Importantly, the cost means you are less likely to play around with the program in your spare time or for non-university activities. Google Sheets is a good example of this: people make budgets or plans in Sheets and, even if they’re simple, you’re learning as you go.
Proprietary programs are also centrally controlled. Their functions are written by the company, and you can only use the set of functions they provide.2
R is free. You can use it from any computer at any time. The analysis you do with R in your undergrad will be reusable when you graduate. It also means that you will be able to display the code of your analysis in a portfolio of work when you’re looking for jobs in the future.
The free-ness of R means more people use it. R is known for a thriving community, meaning you can quickly search for how to DO THE THING I WANT TO DO in R and find an answer from wonderful people on the internet. People on the R internet are wonderful, and you’ll soon feel like this:
‘Me “working independently”’ by Allison Horst, @allison_horst
R also thrives on user-written packages (collections of functions) that are available to everyone, for free. A literature review by Robert Muenchen for r4stats.com noted:3
In 2015, R added 1,357 packages…or approximately 27,642 functions. During 2015 alone, R added more functions than SAS Institute has written in its entire history.
Digression: ideally, as more and more faculties in universities around Australia make R their program-of-choice, you’ll be able to spend more time mastering a single program (and properly understanding what you are doing in that program), and less time fumbling around with a new language for each subject you take.
The main downside to free, open-source software is that there are many ways to do the same thing. If we had a dataset that looked at country population by year, and we only wanted to keep the year 2007, we could do it in five ways (or more!):4
which() function and the %in% operatorsubset() functionfilter() function from the dplyr packageThis makes searching for answers a little bit more difficult. However, this workshop uses tidyverse syntax (which includes filter from the dplyr package in the last example above), so I advise phrasing your Google search as how to DO THE THING I WANT TO DO r tidyverse. This will remove a least a bit of the confusion.
Now that you are completely convinced that R is the best, let’s get into it.
You will need to download and install R and R Studio. (And you can skip this subsection if you already have!).
R is the language and the program. Think of it as the engine that powers the things you do. You can download it for:
Once downloaded, follow the prompts to install. Restart your computer if required.
R Studio is the interface you will use R with. The technical term is an ‘integrated development environment’ (IDE) for R. Think of it as the dashboard that shows you all the things you’ve got going on in R. You can download it for:
Then, follow the prompts to install and restart if required.
R Studio is an integrated development environment (IDE) and is how we will interact with R. It looks like this:
The four panes are labelled in \(\color{green}{\text{green}}\) :
I know this can all look a bit intimidating the first time you see it. That’s okay! We’ll get to know R Studio more as we go through this course.
Good folder structure is tedious and abstract and not-at-all-fun but it makes everything in the future easier. It simply means you have:
introduction_to_R. Your projects might be something like econometrics_assignment2 or honours_thesis. Whatever the project, everything you need for it is contained within the main project folder.data you are using in a folder called data. Keep output (tables, charts, etc) in an output folder. Note that you can set these up how you like: but consistency makes it easier for you to switch between projects and collaborate with others.Your script will often ask for things on the computer. For example, “read in this dataset” or “save this chart to a place”. For that, we have to tell the computer where it is. We can do that in two ways: 1) by explicitly setting a working directory (which is not the best way), and 2) by setting up a R Project (which is the best way to do it).
This is sometimes done by ‘setting a working directory’. This means having a line in your script that says ‘this is where we are’:
This is problematic and frustrating because your directory path won’t be the same as your collaborators’ or tutors’ (unless you have the same name and the same operating system!).
From Hadley Wickham’s R for Data Science:
But you should never [use
setwdto set a working directory] because there’s a better way; a way that also puts you on the path to managing your R work like an expert.
^ That’s you! You’re already an expert.
The best way to tell your computer where it is is to use R Projects. This is a little file that lives in your project folder with the suffix .Rproj. Opening this file opens R and sets your working directory to where it is.
This is beneficial because it means you don’t have to write setwd("Users/yourname/Documents/myRfolder/this_project_of_mine") on every single script you write. It also means that your collaborators can open your project folder on their computer and all scripts will run without a hitch.
You can set up an R Project by clicking File -> New Project and following the prompts.
Objects are ‘things’ that R knows about. They can be simple as a single number, or more complicated like an entire dataset.
You can tell R about an object using the assign <- operator:
We have said: take the number 12 and assign it <- to the object fave_number. R will note this down, put fave_number in our environment and we will be able to use it when we want later. Like:
## [1] 24
Or:
## [1] 49
We can have lots of objects stored in our environment, and we can call them whenever we want (after we have defined them):
## [1] 17
A function takes inputs (aka ‘arguments’) and produces outputs.
We can use the c function to combine (concatenate) numbers into a series of numbers (a vector):
## [1] 3 4 5
The output above, like all output in this document, is preceded by ## and then [1], meaning the first line of output. Here, the output is the vector of numbers we entered into the c function.
We can also nest functions, meaning we have one function inside another function. For example, we can combine numbers into a vector using the c function, then we can take the average (mean) of the vector:
# Use the c function to combine numbers (input) into a vector (output)
# Then take the mean of that vector:
mean(c(3, 4, 5))## [1] 4
But nested functions are a bit difficult to read. You have to start from the inside and read outwards. Alternatively, we could assign our vector to an object using the assign <-operator:
# Use the c function to combine numbers (input) into a vector (output)
# And assign that to the object 'goodnumbers'
goodnumbers <- c(3, 4, 5)
# Then take the mean of goodnumbers
mean(goodnumbers)## [1] 4
This will make changes in our environment to the top-right: it adds the object goodnumbers.
It will also produce output in the console to the bottom-left: the mean of goodnumbers.
It should look something like this:
R uses functions to do things. A package is a collection of functions. Some packages are installed when you install R, like base and stats (this is called “base R”).
A wonderful benefit of R is being able to use the community’s collection of functions. The tidyverse is a collection of functions that make data wrangling and visualisation much easier. And we can use this package for free.
Installing a package is like installing an app on your phone or computer: you need to do it, and you only need to do it once.
You can install a package using the install.packages function. Note that there will be lots of text that appears when installing a package. This text doesn’t make great reading, but it lays out what the package is doing as it installs. If there is an error installing, this is where you’ll find some (hopefully) useful infromation about why.
Now we need to load the package using the library function, like opening an app you have installed. We do this every time (every ‘session’) we want to use it.
In my experience learning R, there are two things I would like to prepare you for:
R will follow your instructions to the tee. If you misspelll something or put an argument where it doesn^t belong, R will try to do the thing you asked and it will fail. Hopefully, it will clearly tell you what happened; other times it will be vague.
This means you will spend a lot of time debugging your code: you think what you’ve written makes sense, but you get an error. You try to work out what’s gone wrong; you play around; you search the internet; you give up on R; you go outside and enjoy your R-free life;
We have installed R and R Studio; installed and loaded packages; and we know about objects and functions.
We’re coders. And now we just have to learn how to do more:
This is a process you will continue as long as you’re using R.
The preamble is meant to get you up-and-running in R, which is necessary. But it is a little bit tedious and boring. In this part, we will go through some fun things like creating graphics. It will follow:
This uses the read_csv function and, here, we’re only going to give it one argument: the path to the csv file you want to read in quotation marks.
Tip: open quotation marks and hit tab to navigate to your file (and save you some typing).
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
Looks good! But what do we have?
We have a tibble: a tidy version of a column/row dataset. This means every observation is a row, and every variable is a column.
The first line of output says that we have 1704 rows of observations \(\times\) 6 columns of varaibles. And from the output we read into R we can see that Afghanistan (country) had a life expectancy (lifeExp) of 28.8 in (year) 1952.
The read_csv function seems to have worked pretty well, and our output makes sense. But the output—our data—isn’t in our environment (on the top-right) yet because we didn’t assign it to anything. We assign something using <-, meaning we can call on it later.
Now it is in our environment on the top-right of our R Studio window. This means we can use gapminder to use it later when we want it.
Much like Excel, we can explore the raw gapminder dataset with our eyes.
View will open up a new tab that displays your dataset. You can scroll through it and look at each row and variable in your data.
If we just want a quick look, head will print just the first few observations. This is handy to check on things as you’re going along.
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
names will display the names of all variables in the dataset (and is often the answer to ‘what was that variable called again…’)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
It is important to visualise your data to properly understand it. The why is explored more in Part 2. For now, let’s get into it:
To visualise data we first need to have questions. Our first question is one of impartance:
Question 1: what is the relationship between life expectancy and income per person?
We have the data to answer this question. When we looked at the gapminder dataset, we saw that there were variables for life expectancy lifeExp, and for income per person gdpPercap in each county in each year.
ggplotSo we can take the gapminder dataset, generate an empty plot using ggplot and fill it with a point: lifeExp on the y axis, and gdpPercap on the x axis:
What just happened there? We used the ggplot2 package, which stands for the layered grammar of graphics plot5
Our plot has a few components:
gapminder object.ggplot using the ggplot function.geom_point to plot dots.
geom_point we have defined an aesthetic with aes and mapped the x axis to represent lifeExp, and the y axis to represent gdpPercap.We can see how that plays out by creating an empty plot:
Then fill it with out data:
And then we have to explain how to map the data to the plot using aesthetics aes, which we add to the plot using + (noting that the + comes at the end of a line).
Start with an empty plot, then layer on a geometric object geom with some aesthetics aes(). We can add more layers to the same plot.
plotlyWe can use the plotly package to interactively explore the scatter plot. First, install the plotly package (remembering quotation marks when we install a package)
Then load the package using library:
Then we define our normal-old-plot as an object using <-. This time we are going to map country to the label aesthetic so we can see which countries are which:
p <- ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
label = country)) +
geom_point()And place the plot in the ggplotly function:
Now we can explore the plot! Move your mouse over the plot and work out what that high income/medium life expectancy country is.
Our plots so far have been on ‘linear’ scales: they go up by the same amount for the whole scale.
But this might not always be the right scale for things that happen exponentially. For example, we might expect that income grows much faster than health does. So we decde to examine the relationship between (exponetial) GDP per capita and life expectancy. We want to use a log10 scale on the y axis (where income is); so we add + the scale_y_log10 function to our plot:
ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap)) +
geom_point() +
scale_y_log10()To add another geom to the same plot, we use the and + symbol, then add our new set of instructions on a new line. A trendline is called using geom_smooth:
ggplot(data = gapminder) +
geom_point(mapping = aes(x = lifeExp,
y = gdpPercap)) +
geom_smooth(mapping = aes(x = lifeExp,
y = gdpPercap)) +
scale_y_log10()Note: there is a whole library of geoms to explore at https://ggplot2.tidyverse.org/reference/.
The trendline helps us with our question: we can see that, overall, the higher income is correlated with higher life expectancy. This is a result we would expect.
But the way we are looking at this data might be hiding some important insights. We should explore them.
Our scatter plot so far shows two aesthetics (aes): lifeExp mapped to the x axis, and gdpPercap mapped to the y axis.
Let’s look at our variable names again and explore if we can squeeze some more information out of this plot:
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
It might be interesting to see how things vary by continent. So let’s map continent to the colour aesthetic. We do this in the same way we mapped life expectancy and GDP per capita to the x and y axes:
ggplot(data = gapminder) +
geom_point(mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent)) +
geom_smooth(mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent)) +
scale_y_log10()We have repeated ourselves a bit here. For each geom we have set the aesthetics: “this is the x axis”, “this is the y axis” and “this is the colour”. To save ourselves some typing (and ensure we’re being consistent)lh we can set the aesthetics in the first ggplot function. The geoms that follow will ‘inherit’ these aesthetics:
ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent)) +
geom_point() +
geom_smooth() +
scale_y_log10()Just like we mapped colour to a country’s continent, we can add size—the size of the points—to a variable (column) in our dataset. Let’s map size to population (pop).
ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent,
size = pop)) +
geom_point() +
geom_smooth() +
scale_y_log10()We have mapped variables to geoms. This means they will take a value (x, y, colour, etc) depending on their variable’s value (lifeExp, gdpPercap, continent, etc).
But what if we just wanted to set a rule? Say, what if we wanted the colour of all points to be blue? Or the transparency of all points to be 50% regardless of their country, lifeExp or gdpPercap?
Remeber that aes values are mapping values: they map a variable to a thing, and we keep those mappings in the aes() function.
If we want to set a rule outisde of an aesthetic, we do just that:
ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent,
size = pop)) +
geom_point(colour = "blue") + # colour = blue is outside of an aesthetic
scale_y_log10() +
labs(title = "Yay all of our points are blue!")What would happen if we put colour = "blue" inside of aes? ggplot would do as it’s told: it would choose a colour for each value of the character string “blue”. As there is only one value, our chart looks like this:
ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent,
size = pop)) +
geom_point(aes(colour = "blue")) + # oh no, colour = blue is now INSIDE of aes().
scale_y_log10() +
labs(title = "Oh no! None of our points are blue")But we got good information from colouring continents by colour, so let’s not throw that away. We can instead say “set the transparancy to 50%”. Transparancy in the ggplot-world is alpha. So we’ll set alpha = 0.5 to the geom_point geometry.
ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
size = pop,
colour = continent)) +
geom_point(alpha = 0.5) + # added this to change transparancy, outside of aes().
scale_y_log10() +
labs(title = "That's some nice transparency")facetsThis is all getting a bit busy. It might be clearer to see each continent on its own separate chart. We can do this by adding a ‘facet’ to our plot: i.e. taking our chart, and plotting it ‘around’ ~ another variable.
# Define chart as an object
chart <- ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent,
size = pop,
label = country)) +
geom_point(alpha = 0.5) +
scale_y_log10() +
facet_wrap(~continent) +
labs(title = "GDP per capita and life expectancy, by continent",
caption = "Source: Gapminder from https://gapminder.org")
# View the chart
chartThe code above says: take the chart we (now) know and love, and do that for each continent in our dataset showing each seperately.
A trendline might be nice, and we can do that by adding it to our chart object:
And, like we did above, we can make this plot interactive with ggplotly:
We defined a chart—an object uncreatively called chart—and, because we’re only human, we would like to animate it into a gif to see it move over time.6
Animating a plot is easy thanks to the gganimate package. It follows the same ‘grammar of graphics’ structure as ggplot, and we just tack it on the end of our plot-making.
First we install the gganimate package, which has been (re-)built by Thomas Lin Pedersen7
Once the gganimate package is installed, we load it using library and add a few bits to our chart. (Note that following it will take a minute or two to build the animation)
library(gganimate)
an <-
chart +
transition_time(year) +
labs(title = "GDP and life expectancy in {round(frame_time, 0)}")And we can save the animation using anim_save (this will save the last animation you created by default).
We have produced some nice graphics looking at the relationship between a country’s GDP per capita (per person) gdpPercap, and that country’s life expectancy lifeExp. Now we want to look deeper.
Recall our gapminder dataset:
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
It has rows and columns; observations and variables. Now, close your eyes and picture the gapminder dataset.
gdp, which is gdpPercapita \(\times\) pop.So: what happened to our gapminder dataset?
2007:We’ll explore through each of these steps slowly.
First, we created a new column called gdp.
This uses the function mutate, which works like this:
This means: take mydata and add a column newvar, which is 10 for every observation. See how we use one equals sign = to define something. You can read this as: newvar IS 10. (We’ll look at what two equals signs == means in the next section).
Thinking about the gapminder dataset, we could say that we wanted—for some reason—to make everyone richer with a everyone_richer variable that was the current GDP per capita gdpPercap \(\times\) 1000:
# To the 'gapminder' dataset, add a new variable called `everyone_richer'
mutate(gapminder, everyone_richer = gdpPercap * 1000)## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap everyone_richer
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 779445.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 820853.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 853101.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 836197.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 739981.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 786113.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 978011.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 852396.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 649341.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 635341.
## # … with 1,694 more rows
Great—everone is richer! But note that this is not stored anywhere, the dataset with the everyone_richer variable was just printed on your screen. But we saw it do an important thing: for each observation, it took whatever the value of gdpPercap was and multiplied that value by 1000.
To get to the thing we were trying to do, add a gdp variable, we can use the mutate function and make sure we define it as an object:
# To the 'gapminder' dataset, add a new variable called `pop'
# Define this as 'gap_gdp'
gap_gdp <- mutate(gapminder, gdp = gdpPercap * pop)As above, this will take the gapminder dataset and add a new variable gdp which is equal to each observations GDP per capita multiplied by its population.
To make sure this as all worked, we can print the head of our dataset:
## # A tibble: 6 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
And, like we have done many times before, we can visualise it:
# Plot 'gap_gdp' and define as 'this_plot'
this_plot <- ggplot(gap_gdp,
aes(x = lifeExp,
y = gdp,
colour = continent,
label = country)) +
geom_point() +
scale_y_log10()
# Look at 'this_plot'
this_plotNote that here we have defined an object called this_plot with our plot, and then called this_plot by writing it.
We could make it easier to explore interactively by again using ggplotly:
Next, we want to filter our dataset to only keep observations from 2007. i.e. we want to know what the state of the world was before the global financial crisis. To do this, we use the (suprise!) filter function. Our first argument is the dataset we want to do something to, and we follow that by a condition:
filter(original_data, [CONDITION])
A conditional statement is one for which some things are TRUE and some things are FALSE. A quick example of conditionals is below.
10 does NOT equal 20, so the ‘answer’ to this is FALSE:
## [1] FALSE
See how we are using two equals signs == to declare something is equal to. You can read the above as: 10 IS EQUAL TO 20.
But 10 DOES equal 5 \(\times\) 2, so the this is TRUE:
## [1] TRUE
We use two equals signs == to require something to be true, whereas we use a single equals sign = to declare something as true, which is why we use it to define new variables. For example:
## [1] FALSE
The first line reads “declare (or assign) x as 10”, which stores x in our envrironment and doesn’t prouduce any output. The second line reads “x IS EQUAL TO 2”, which is not true (because we said that x was equal to 10). It produces the somewhat-aggressive output FALSE.
We can also use the does not equal sign != to say “this DOES NOT EQUAL that”. Below we are saying that 10 DOES NOT EQUAL 5\(\times\)2
## [1] FALSE
[potentially go into detail on conditionals because the examples can be funs]
Anyway: we want to filter our data to only those observations for which the year of the observations IS EQUAL TO 2007, or as we have learnt: year == 2007. The filter function does this:
And we can quickly check if that has worked by looking at the gap_gdp07 dataset, selecting $ only the year variable:
## [1] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [15] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [29] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [43] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [57] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [71] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [85] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [99] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [113] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [127] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
## [141] 2007 2007
This reads: Take the gap_gdp07 dataset and choose $ the year column.
It looks like they’re all 2007—which is exactly what we wanted. We could also wrap that in the unique function, which removes any duplicates and only shows unique values:
## [1] 2007
This reads: I want to look at each unique value (i.e. remove any duplicates) of the year column $ of the gap_gdp07 dataset.
Great! There’s only one unique year number in the gap_gdp07 dataset. Just what we wanted.
We can choose and drop variables in our dataset using the select function. This comes in handy when you’re working with large datasets and your poor computer only has so much memory. Since we have already filtered our dataset to only include observations from 2007, we can drop the year variable.
This reads: define gap_gdp07_noYear as the gap_gdp07 dataset and negatively select (drop) the year variable.
Wonderful! We have done the three things we wanted: we have one dataset that adds the gdp variable; the next that only keeps observations from 2007; and a final one that removes the year variable.
But, creating all these obscure datasets is odd. There is a better way.
This %>% is a pipe! The pipe is an odd concept and it is wonderful. Think of the things we’ve just done, where our goal was to create a new variable, keep observations from 2007 and drop the year variable:
gap_gdp <- mutate(gapminder, gdp = gdpPercap * pop)
gap_gdp07 <- filter(gap_gdp, year == 2007)
gap_gdp07_noYear <- select(gap_gdp07, -year)We’ve created a whole bunch of objects that we don’t really care about. We can neatly put this together with pipes %>%.
A pipe works by taking the thing behind it and making it the first argument in the function after it. So if we were simply adding 5 + 7 and then wanted to take the square root of that number, we could define an object:
And take the square root of that object:
## [1] 3.464102
OR we could pipe %>% our number into the square root sqrt function:
## [1] 7.645751
The pipe %>% takes the things behind it and makes them the first argument in the next funtion. This is useful! Because we can do all the things we wanted to do in our three-step program in one:
gapminder07 <- gapminder %>% # Take the gapminder dataset...
mutate(gdp = gdpPercap * pop) %>% # and add a new variable 'gdp'...
filter(year == 2007) %>% # and filter to keep only observations from 2007
select(-year) # and drop the year variableVerbally, this says: assign gapminder07 to the original gapminder dataset, but add a column called gdp, then filter to only include observations from 2007, then drop the year variable.
This ‘piping’ means we can pretty quickly filter and adjust graphs. Recall that the ggplot function needs a dataset as its first argument, from Part 1:
ggplot(data = gapminder,
mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent,
size = pop)) +
geom_point() +
geom_smooth() +
scale_y_log10()So: data is the first argument. Whatever we pipe %>% into it will be the data argument. Which means we can use our new filtering skills before we plot something, and pipe %>% it into our ggplot:
gapminder %>%
filter(year == 2007) %>%
ggplot(mapping = aes(x = lifeExp,
y = gdpPercap,
colour = continent,
size = pop)) +
geom_point() +
geom_smooth() +
scale_y_log10()## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 80.199
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 2.6574e-05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 80.199
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.005155
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 81.24
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 2.6574e-05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 2.6574e-05
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)
Using an R Markdown document means your analysis and your writing are in one place (this document is all written in R Markdown).
R Markdown is a type of file that produces documents. It will process your R code and its output, then typeset/produce your file. It uses the same wonderful math language as LaTeX, too.8
To start an R Markdown document, click File->New File->R Markdown. This will set you up and explain a few things:
Headers are defined by # (for level 1), ## (for level 2), etc:
# My main header
## A subheader
At some point throughout your university life you will need to write equations in a document.
$A = (r^{4}) / $
"Read a .csv file from the path "data/gapminder.csv"
It can be accessed for free here: https://r4ds.had.co.nz↩
Stata has made some grounds in this area, allowing .ado files written by users to be shared and used. But this is not near the levels of user-written functions in R.↩
http://www.r4stats.com/articles/popularity/ (the potential bias is indicated in its domain)↩
Adapted from https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/↩
You can read Hadley Wickham’s _A Layered Grammar of Graphics here. This section is a quick and incomplete summary.↩
This is not a key feature of data science (yet), but it is fun.↩
You can check out more of his work here: https://github.com/thomasp85↩
A lot of academic work is written in a typesetting language called LaTeX. R Markdown is a bit simpler to get started with and is a solid alternative in most cases.↩
Comments in your code
‘Comments’ are notes that live in your code and are preceded with
#. R will ignore anything after the#symbol:If we forgot to use the
#to ‘comment’ things, we would generate a bunch of errors as R tries to work out what you’re on about: